K-Means Clustering - A Sparse Version of Principal Component Analysis

Introduction

K-Means algorithm is one of the popular clustering analysis techniques, that helps group similar data points into clusters and look for hidden insights. For example, by analyzing sales data, customers can be grouped into segments to help understand their behaviour. In one of the earlier posts, we explored Principal Component Analysis, a dimension reduction tool that transforms a number of (possibly) correlated columns into a (smaller) number of uncorrelated columns called principal components.


K-Means and PCA are thought of as two different procedures, having two different purposes and at first sight do not seem related. But that isn’t true!


While principal components can help rank the records of the dataset, k-means just groups the records into, for example, good/bad. Thus, K-Means is a sparse version of PCA. We explore this idea more with the European employment dataset by grouping similar countries. The Principal Components are then combined with the results of K-Means to see the similarities between the two techniques.

Data

The dataset contains the distribution of employment in different sectors of 26 European countries. There are ten variables, the first being the name of the country, and the remaining nine holding the percentage employment of the countries’ citizens in nine different industries.

The dataset seems to contain USSR and East Germany, so I’m assuming this comes from at least from or before the 80’s. Data is read in and a box plot of distributions in each industries in presented below.

K-Means Basics

K-Means clustering is one of the most commonly used unsupervised machine learning algorithms to group data into a pre-specified K clusters. The data points are assigned to the groups in such a way that the data points are as similar as possible. Each cluster is represented by a centroid, which is the mean value of all the data points in it. The within-cluster similarity is measured through the sum of squares of the residuals between the centroid and all other data points assigned to that centroid.

\[withinness(C_k) = \sum_{k\in C_k}{(x_{ik} - \mu_k)^2} \]
  • \(C_k\) is cluster K
  • \(x_{ik}\) is a data point belonging to a centroid K
  • \(\mu_k\) is the average of all data points in cluster K

For reference, this excellent post by Brad Boehmke goes into a lot more detail.

The K-Means algorithm is applied to the dataset with the clusters varying between 2 to 7. The countries are colored based on the cluster they are assigned to and presented below on the chloropleth. The number of clusters can be changed from the dropdown. For two clusters, the algorithm pits Turkey against the rest of the Europe. With seven clusters, we see the countries separated broadly into the central Europe, the Mediterranean, East and Scandinavia. Turkey and Hungary are in their own clusters.

# K-Means requires data to be scaled normally N(0,1).
# This is to make Euclidean distance measurement comparable.
data_scaled <- scale(data) %>% as.data.frame()

# Country code is combined to plot on map
country_code <- read_csv("country_code.csv") %>% 
  rename(country = "COUNTRY", code = "CODE")
data_kmeans <- data_scaled %>% rownames_to_column("country") %>% 
  left_join(country_code, by = "country") 

# Combining the results of K-Means algorithm
data_kmeans$k <- 0
data_kmeans$cluster <- 0
# Into geo_data
geo_data <- NULL

for (i in 2:7) {
  kmeans_model <- kmeans(data_scaled, i, nstart = 25)
  data_kmeans$cluster <- kmeans_model$cluster
  data_kmeans$k <- i
  geo_data <- geo_data %>% bind_rows(data_kmeans)
}

# Drop unnecessary columns
geo_data <- geo_data %>% select(country, code, k, cluster)

# Change data structure for Plotly
geo_data_played <- geo_data %>% spread(k, cluster, sep = "value")
geo_colnames <- colnames(geo_data_played)

# Plotluy Chloropleth settings
l <- list(color = toRGB("grey"), width = 0.5)
g <- list(showframe = FALSE, 
          # lonaxis = list(range = c(-20, 50)),
          # lataxis = list(range = c(30, 90)),
          lonaxis = list(range = c(-10, 50)),
          lataxis = list(range = c(30, 80)),
          showcoastlines = FALSE,
          projection = list(type = 'Mercator'))

# Build plot with data for two clusters visible
geo_plot <- geo_data_played %>% plot_geo() %>%
  add_trace(z = ~kvalue2, color = ~kvalue2, 
            text = ~country, locations = ~code, 
            marker = list(line = l), showscale = FALSE) %>% 
  colorbar(title = "Scale") %>%
  layout(geo = g,
         title = "Countries grouped based on employment",
         annotations = list(x = 0, y = 1, text = "Change clusters in dropdown",
                            showarrow = FALSE, xref = "paper", yref = "paper"))

# Add plots for clusters 3:7 invisible
for (col in geo_colnames[4:8]) {
  geo_plot <- geo_plot %>% 
    add_trace(z = geo_data_played[[col]], color = geo_data_played[[col]], 
              text = ~country, locations = ~code, 
              marker = list(line = l), showscale = FALSE, visible = FALSE) 
}

# Add dropdown for clusters and set corresponding visibility 
geo_plot %>%
  layout(
    updatemenus = list(
      list(y = 0.9, x = 0.08,
           buttons = list(
             list(method = "restyle",
                  args = list("visible", list(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)),
                  label = "2"),
             list(method = "restyle",
                  args = list("visible",list(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)),
                  label = "3"),
             list(method = "restyle",
                  args = list("visible", list(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)),
                  label = "4"),
             list(method = "restyle",
                  args = list("visible", list(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)),
                  label = "5"),
             list(method = "restyle",
                  args = list("visible", list(FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)),
                  label = "6"),
             list(method = "restyle",
                  args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)),
                  label = "7")
           )))) %>% 
  config(displayModeBar = FALSE) 

As we already observed, the dataset contains East and West Germany, Yugoslavia and other countries which do not exist in the maps anymore. If you do not see 7 colors for 7 clusters, that’s because the missing countries are not being represented on the map

We clearly see patterns emerging that we didn’t know existed before. Countries that are geographically and culturally closer had similar employment patterns. Unsupervised learning thus helped us draw inferences without a labelled response in the input dataset.

Determining the optimal number of clusters

Because the data was on a map, it was easier for us to say that 7 clusters grouped the data better than 2. But in other cases where the output may not be easily interpretable, it becomes essential to be definite about the number of clusters. There are three major methods to determine K.

  • Elbow - The number of clusters is plotted against the within residual sum squares and the cluster where there is a sharp drop is considered optimal.
  • Average Silhouette - The silhouette value is a measure of how similar an data point is to its own cluster compared to other clusters. The optimal number of clusters is the one that maximizes the average silhouette.
  • Gap Statistic - This method computes the within-cluster variations and compares them against the clusters of another randomly generated reference dataset. The K that gives the highest difference is considered optimal.

The result of the Elbow method is ambiguous, while three clusters have the highest average silhouette. The gap statistic method suggests one cluster as the optimum, which is impractical.

Going by the Silhouette method, three clusters broadly separate the data into western and eastern Europe, leaving out Turkey and Yugoslavia into the third cluster. This seems fair, and the rest of the document will work with three clusters.

Merging Results with Principal Components

PCA model is applied to the data and the results are printed below. It’s observed that the first three components explain ~76% of the variance. We will be utilizing only the three components to make plotting possible.

While, PCA assigned each country to a point in a 3-dimensional plane, the role of K-Means is only to assign each country to one of the K clusters. Thus a K-Means model with n clusters, where n is the number of data points will essentially be a PCA.

Goes to show that PCA is one step on top of k-means

##                Agriculture Mining Manufacturing PowerSupply Construction
## Belgium                3.3    0.9          27.6         0.9          8.2
## Denmark                9.2    0.1          21.8         0.6          8.3
## France                10.8    0.8          27.5         0.9          8.9
## WGermany               6.7    1.3          35.8         0.9          7.3
## Ireland               23.2    1.0          20.7         1.3          7.5
## Italy                 15.9    0.6          27.6         0.5         10.0
## Luxembourg             7.7    3.1          30.8         0.8          9.2
## Netherlands            6.3    0.1          22.5         1.0          9.9
## United Kingdom         2.7    1.4          30.2         1.4          6.9
## Austria               12.7    1.1          30.2         1.4          9.0
## Finland               13.0    0.4          25.9         1.3          7.4
## Greece                41.4    0.6          17.6         0.6          8.1
## Norway                 9.0    0.5          22.4         0.8          8.6
## Portugal              27.8    0.3          24.5         0.6          8.4
## Spain                 22.9    0.8          28.5         0.7         11.5
## Sweden                 6.1    0.4          25.9         0.8          7.2
## Switzerland            7.7    0.2          37.8         0.8          9.5
## Turkey                66.8    0.7           7.9         0.1          2.8
## Bulgaria              23.6    1.9          32.3         0.6          7.9
## Czechoslovakia        16.5    2.9          35.5         1.2          8.7
## EGermany               4.2    2.9          41.2         1.3          7.6
## Hungary               21.7    3.1          29.6         1.9          8.2
## Poland                31.1    2.5          25.7         0.9          8.4
## Romania               34.7    2.1          30.1         0.6          8.7
## Russia                23.7    1.4          25.8         0.6          9.2
## Yugoslavia            48.7    1.5          16.8         1.1          4.9
##                ServiceIndustry Finance PersonalServices TransportAndComm
## Belgium                   19.1     6.2             26.6              7.2
## Denmark                   14.6     6.5             32.2              7.1
## France                    16.8     6.0             22.6              5.7
## WGermany                  14.4     5.0             22.3              6.1
## Ireland                   16.8     2.8             20.8              6.1
## Italy                     18.1     1.6             20.1              5.7
## Luxembourg                18.5     4.6             19.2              6.2
## Netherlands               18.0     6.8             28.5              6.8
## United Kingdom            16.9     5.7             28.3              6.4
## Austria                   16.8     4.9             16.8              7.0
## Finland                   14.7     5.5             24.3              7.6
## Greece                    11.5     2.4             11.0              6.7
## Norway                    16.9     4.7             27.6              9.4
## Portugal                  13.3     2.7             16.7              5.7
## Spain                      9.7     8.5             11.8              5.5
## Sweden                    14.4     6.0             32.4              6.8
## Switzerland               17.5     5.3             15.4              5.7
## Turkey                     5.2     1.1             11.9              3.2
## Bulgaria                   8.0     0.7             18.2              6.7
## Czechoslovakia             9.2     0.9             17.9              7.0
## EGermany                  11.2     1.2             22.1              8.4
## Hungary                    9.4     0.9             17.2              8.0
## Poland                     7.5     0.9             16.1              6.9
## Romania                    5.9     1.3             11.7              5.0
## Russia                     6.1     0.5             23.6              9.3
## Yugoslavia                 6.4    11.3              5.3              4.0
##                cluster
## Belgium              2
## Denmark              2
## France               2
## WGermany             2
## Ireland              2
## Italy                2
## Luxembourg           2
## Netherlands          2
## United Kingdom       2
## Austria              2
## Finland              2
## Greece               1
## Norway               2
## Portugal             2
## Spain                2
## Sweden               2
## Switzerland          2
## Turkey               3
## Bulgaria             1
## Czechoslovakia       1
## EGermany             1
## Hungary              1
## Poland               1
## Romania              1
## Russia               1
## Yugoslavia           3
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4    Comp.5
## Standard deviation     1.9021582 1.6584794 1.0993847 0.99968878 0.7523550
## Proportion of Variance 0.3618206 0.2750554 0.1208647 0.09993777 0.0566038
## Cumulative Proportion  0.3618206 0.6368760 0.7577407 0.85767843 0.9142822
##                            Comp.6     Comp.7     Comp.8      Comp.9
## Standard deviation     0.61924270 0.48131550 0.39032100 0.299425677
## Proportion of Variance 0.03834615 0.02316646 0.01523505 0.008965574
## Cumulative Proportion  0.95262838 0.97579484 0.99102989 0.999995462
##                             Comp.10
## Standard deviation     6.736270e-03
## Proportion of Variance 4.537734e-06
## Cumulative Proportion  1.000000e+00
## 
## Loadings:
##                  Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Agriculture       0.481  0.212                0.209  0.155              
## Mining           -0.104  0.474 -0.387        -0.215        -0.743       
## Manufacturing    -0.391  0.173 -0.228 -0.345 -0.413 -0.289  0.443  0.230
## PowerSupply      -0.268        -0.635  0.300  0.346  0.357  0.274 -0.258
## Construction     -0.337         0.221 -0.633  0.392  0.136 -0.268 -0.197
## ServiceIndustry  -0.282 -0.426               -0.242  0.612 -0.191  0.185
## Finance                 -0.456 -0.389 -0.170  0.392 -0.528 -0.126       
## PersonalServices -0.331 -0.280  0.266  0.455 -0.214 -0.264 -0.141 -0.490
## TransportAndComm -0.402  0.101  0.243  0.378  0.423 -0.117         0.651
## cluster           0.260 -0.465 -0.245        -0.169        -0.172  0.376
##                  Comp.9 Comp.10
## Agriculture              0.806 
## Mining                         
## Manufacturing            0.366 
## PowerSupply       0.201        
## Construction      0.384        
## ServiceIndustry  -0.414  0.238 
## Finance          -0.374  0.145 
## PersonalServices  0.190  0.351 
## TransportAndComm               
## cluster           0.674
##                     Comp.1      Comp.2       Comp.3      Comp.4
## Belgium        -1.26398437 -1.74452902 -0.001254644  0.30088036
## Denmark        -0.42319466 -2.13009597  1.267561841  0.68255043
## France         -0.37568334 -1.38641351 -0.211261881 -0.59531701
## WGermany       -0.66417801 -0.54606693 -0.705667544 -0.17071675
## Ireland         0.30795143 -0.56803651 -0.444936062  0.84560846
## Italy          -0.15737517 -0.78384936  1.239449014 -1.27738004
## Luxembourg     -1.00888676  0.02964664 -0.915910071 -0.86246689
## Netherlands    -1.11776190 -2.31783849  0.493832715 -0.07058767
## United Kingdom -1.29458951 -1.17511448 -1.318856435  1.07690996
## Austria        -0.99904082 -0.52319862 -1.104919156 -0.30639463
## Finland        -0.66825074 -1.16125689 -0.387187335  1.08772186
## Greece          1.66821725  1.10063920  1.421663839 -0.24578174
## Norway         -1.33616753 -1.43721163  1.245139034  1.11960192
## Portugal        1.17330445 -0.40684706  0.867566173 -0.62428062
## Spain           0.62456690 -0.54134131 -0.258016002 -2.62650851
## Sweden         -0.61866117 -1.80743659  0.529021065  1.04681985
## Switzerland    -0.74468773 -1.14058159 -0.259581241 -1.92887551
## Turkey          6.74912683  0.07295877  0.733406985  1.39203411
## Bulgaria        0.01052941  2.18481086  0.861122924 -0.21255162
## Czechoslovakia -1.25452582  2.66939595 -0.599113312 -0.12345973
## EGermany       -2.55119521  2.35217059 -0.802016982  0.69823690
## Hungary        -1.45418762  2.91662644 -1.588221915  1.18781814
## Poland          0.31319967  2.59746270  0.346285490  0.12767890
## Romania         1.25913757  2.62765892  0.377216957 -1.29492928
## Russia         -0.74978990  1.95132736  2.179866351  0.70615783
## Yugoslavia      4.57612673 -0.83287948 -2.965189807  0.06723130
##                     Comp.5      Comp.6      Comp.7      Comp.8      Comp.9
## Belgium        -0.26493349 -0.05031098 -0.30522353  0.19871028 -0.39551485
## Denmark         0.16472253 -0.82939769 -0.22554242 -0.45419677 -0.04681476
## France         -0.20433875  0.11067018 -0.08998681 -0.39634443 -0.11213003
## WGermany       -1.14108865 -0.62470643  0.45544698  0.20605008 -0.04837512
## Ireland         0.08990711  1.42160626  0.02147256 -0.35276220  0.07791895
## Italy          -0.82101099  1.00486930 -0.20483324  0.03564436  0.35256273
## Luxembourg     -1.01519509  0.22333677 -1.75693579  0.37315760 -0.18921222
## Netherlands     0.73124718  0.21147472 -0.21943217 -0.64577247  0.09407475
## United Kingdom -0.64390343 -0.04597553  0.17429240 -0.48561713 -0.10682667
## Austria         0.51605353  0.80576682  0.32279823  0.39873937  0.21543204
## Finland         0.59999201  0.08128513  0.66290541  0.14765721  0.07960810
## Greece          1.12682419  0.82643411  0.24964439  0.11510659 -1.06355289
## Norway          0.77431994 -0.03604708 -0.41728307  0.94648704  0.06089153
## Portugal       -0.17324753  0.51739112  0.40307387  0.10926017  0.20663939
## Spain           1.36869222 -0.78753102 -0.17555349 -0.14228175  0.43335863
## Sweden         -0.43163150 -0.87711688  0.17402495 -0.48152943 -0.07033578
## Switzerland    -0.58604676  0.17550895  1.01967829  0.48630810 -0.04168648
## Turkey         -1.26397102  0.23565361 -0.06675019  0.15525190  0.37681530
## Bulgaria       -0.62801419 -0.55941867  0.27980994 -0.01923775 -0.22023873
## Czechoslovakia -0.32086269 -0.11022522 -0.05339428 -0.20253305  0.15523434
## EGermany       -0.77169122 -0.54169934  0.31624542  0.47217416 -0.07718427
## Hungary         0.91674303  0.77415319 -0.01969164 -0.27755694  0.31282140
## Poland          0.32349143  0.02399788 -0.42162191 -0.25589146 -0.07605085
## Romania        -0.31414827 -0.30638122  0.12700769 -0.59510089 -0.14798054
## Russia          0.90006482 -0.78094985 -0.09898534  0.36225716  0.40991418
## Yugoslavia      1.06802559 -0.86238813 -0.15116625  0.30202025 -0.17936815
##                      Comp.10
## Belgium        -0.0014787407
## Denmark         0.0153670356
## France         -0.0007976758
## WGermany       -0.0064772992
## Ireland         0.0108685808
## Italy           0.0061906947
## Luxembourg      0.0036874581
## Netherlands    -0.0110757776
## United Kingdom -0.0057793140
## Austria        -0.0023301685
## Finland         0.0061163175
## Greece         -0.0041461928
## Norway         -0.0085464274
## Portugal        0.0032645068
## Spain          -0.0058784190
## Sweden         -0.0031452857
## Switzerland     0.0018930369
## Turkey         -0.0078639053
## Bulgaria       -0.0037678286
## Czechoslovakia -0.0104702057
## EGermany        0.0074496218
## Hungary        -0.0012881024
## Poland         -0.0018677731
## Romania         0.0069421421
## Russia          0.0062086523
## Yugoslavia      0.0069250692

K-Means in Matrix Form

K-Means algorithm is essentially an optimization problem where the objective function is to come up with centroids such that the residual sum squares is minimized. The only constraint is that each of the data points can belong to only one cluster.

In other words, K-means and PCA maximize the same objective function, with the only difference being that K-means has additional “categorical” constraint.

PCA is relaxed stuff Throw in a little matrix and images. Credit quora.

https://stats.stackexchange.com/questions/183236/what-is-the-relation-between-k-means-clustering-and-pca

https://www.quora.com/What-is-the-relationship-between-K-means-and-PCA

PCA is relaxed k means

For HOMEWORK

Harish M